Fix and rework GPT-TF.js #807

JulienVig · 2024-10-16T16:51:51Z

Closes #654

Fix weight initialization from zero to random uniform
Implement weight sharing between token embeddings and the language modeling head
Improve generation with top k sampling option
Add seed for deterministic runs
Implement text loaders by byte chunk rather than by line which doesn't require to pad each line to the context length
Waiting on Transformers.js #984 and #1019 to migrate to @huggingface/transformers v3.

tharvik

superb work, thanks for clearing the GPT's mud, every comments makes it more understandable!
yeah, sadly, as I forgot to merge the processing PR (#781) before you branched off, the whole processing pipeline changed a lot. sorry for the toes stepping (hopefully, it will simplify this PR).

btw, it seems that @xenova/transformer has been recently updated to @huggingface/transformer. did you try it out? maybe it'll help with the tokenizer usage (doesn't look much changed to me but you know best)

tharvik · 2024-10-30T15:37:22Z

discojs-node/src/loaders/text.ts

+      const tokens = models.tokenize(tokenizer, endOfPreviousChunk + chunk, {
+        padding: false,
+        truncation: false,
+        return_tensor: false,
+      })
+      if (tokens.length < blockSize + 1) {
+        // throw if it happens on the 1st iteration
+        if (iteration === 0)
+          throw new Error(`the chunk (${tokens.length} tokens) is too small ` +
+            `to get a sequence of length blockSize (${blockSize + 1} tokens). ` +
+            `Either the text file or the chunk size (${chunkBitSize} bits) is too small.`);
+        // if this isn't the first iteration we simply skip
+        // as we expect the last chunk to be potentially smaller than the block size
+        debug("chunk smaller than block size, loading next chunk")
+        continue
+      }
+      debug("batch per chunk: %o", tokens.length / (batchSize * blockSize))
+      let currentPosition = 0;
+      // yield one block of tokens at a time
+      while (currentPosition + blockSize + 1 <= tokens.length) {
+        yield tokens.slice(currentPosition, currentPosition + blockSize + 1);
+        currentPosition += blockSize; // don't add + 1 here
+      }
+      // keep the last tokens for the next chunk
+      // if this was the last one the remaining tokens are discarded
+      if (currentPosition < tokens.length) {
+        // We actually need to decode the tokens to get the leftover text
+        // instead of simply keeping the remaining tokens.
+        // this is because the tokens may be different once prepended to the next chunk
+        // e.g. if the remaining text is ". A" and the next chunk starts with "nother"
+        // the tokenization will be different than if we simply concatenate the remaining tokens
+        endOfPreviousChunk = tokenizer.decode(
+          tokens.slice(currentPosition),
+          { skip_special_tokens: true }
+        )
+        debug("End of chunk, remaining text: '%s'", endOfPreviousChunk)
+      } else {
+        // Note that the difference between tokenizing and then concatenating
+        // vs concatenating and then tokenizing can happen if their is no
+        // remaining text. We consider this difference negligible
+        endOfPreviousChunk = "";
+      }


duplicated in discojs-web/loaders/text. that hints to me that it shouldn't happen in the loader but applied after.
the issue at hand is that lines where outputted by the previous version. I think we can change it to output characters (single letter string) instead. that also would drop the blockSize, batchSize & minChunkSize argument which aren't really relevant for reading text (separation of concerns and all that)

in the newly merged processing PR (#781), it is much simpler to combine such transformation, I think that smth like

loadText($path).batch($blockSize).map((block) => tokenize(block, $tokenizer))

with tokenize updated to accept block/List<string> instead, and maybe drop the padding (but what would be the behavior at the end of the file?)

Alright, the current implementation ended up being much simpler than my original tokenizing text loaders:

loadText('../datasets/wikitext/wiki.train.tokens') .map(text => processing.tokenize(tokenizer, text)) // tokenize the whole chunk .flat() // (I renamed unbatch to flat) .batchWithOverlap(config.blockSize) // one token overlap between each batch .map((tokens) => [tokens.pop(), tokens.last()] as [List<number>, number]) .batch(batchSize);

discojs-node/src/loaders/text.ts

cli/src/benchmark_gpt.ts

discojs-node/src/loaders.spec.ts

docs/CONTRIBUTING.md

webapp/cypress.config.ts

webapp/src/components/dataset_input/TextDatasetInput.vue

webapp/src/components/testing/TestSteps.vue

martinjaggi · 2024-11-14T09:03:47Z

maybe rename block size to context len, that would be more specific

…odeling head and attention bias

… implement topk sampling

…ers following GPT2 convention, use LogLayer

… and language modeling head

…red task parameter for text tasks

…he model init config

discojs/src/dataset: implement and test repeat and batchWithOverlap

JulienVig self-assigned this Oct 16, 2024

JulienVig force-pushed the 654-improve-gpt-julien branch 20 times, most recently from 1d88d35 to 03e5c7d Compare October 23, 2024 11:58

JulienVig marked this pull request as ready for review October 24, 2024 15:58

JulienVig requested a review from tharvik October 24, 2024 15:58

tharvik requested changes Nov 1, 2024

View reviewed changes

JulienVig force-pushed the 654-improve-gpt-julien branch from 0a28b81 to f7f96dc Compare November 12, 2024 15:49

JulienVig added 4 commits November 14, 2024 16:19

docs/CONTRIBUTING: add documentation on debug statements and cypress

686e2a0

discojs/src/default_tasks: change default batch size and block size

1be33d6

cli: add train_gpt script

0491ee6

discojs/src/models/gpt/models: fix loss averaging across iterations

7f7b778

JulienVig added 11 commits November 14, 2024 16:27

discojs & cli: change gpt2 vocab size to 50257 instead of 50258

7a14023

discojs/src/models/gpt: allow model init config to be partial

d87f51a

discojs/src/models/gpt/index: link source repo

e947919

discojs/src/models/gpt: always use the token embeddings, a language m…

929deb8

…odeling head and attention bias

discojs/src/models/gpt: generation code: clean, document and improve.…

89b3cac

… implement topk sampling

discojs/src/models/gpt/layers: document tensor operations, rename lay…

b8945aa

…ers following GPT2 convention, use LogLayer

discojs/src/models/gpt/layers: share weights between token embeddings…

38e5882

… and language modeling head

discojs/src/models/gpt/layers: fix weight initializations

2632737

discojs/src/models/gpt: add seed, loss is now identical between runs

a6a517b

discojs/src/task/training_information: make maxSequenceLength a requi…

a596ead

…red task parameter for text tasks

discojs/src/default_tasks/wikitext: use task's maxSequenceLength in t…

c6beac9

…he model init config

JulienVig force-pushed the 654-improve-gpt-julien branch from f7f96dc to 6dc5c51 Compare November 14, 2024 15:35

*: replace line by line text loaders by chunk by chunk text loaders

c477bb3

discojs/src/dataset: implement and test repeat and batchWithOverlap

JulienVig force-pushed the 654-improve-gpt-julien branch from 6dc5c51 to c477bb3 Compare November 14, 2024 15:56

JulienVig added 2 commits November 14, 2024 17:00

discojs*: rename .unbatch() to .flat()

cb806c0

discojs*,cli*: rename blockSize and maxSequenceLength to contextLength

8cbc96e

JulienVig force-pushed the 654-improve-gpt-julien branch from 68d957b to 8cbc96e Compare November 14, 2024 16:49

JulienVig added this to the v4.0.0 milestone Nov 14, 2024

JulienVig requested review from tharvik and removed request for tharvik November 14, 2024 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and rework GPT-TF.js #807

Fix and rework GPT-TF.js #807

JulienVig commented Oct 16, 2024 •

edited

Loading

tharvik left a comment

tharvik Oct 30, 2024

JulienVig Nov 14, 2024

martinjaggi commented Nov 14, 2024

Fix and rework GPT-TF.js #807

Are you sure you want to change the base?

Fix and rework GPT-TF.js #807

Conversation

JulienVig commented Oct 16, 2024 • edited Loading

tharvik left a comment

Choose a reason for hiding this comment

tharvik Oct 30, 2024

Choose a reason for hiding this comment

JulienVig Nov 14, 2024

Choose a reason for hiding this comment

martinjaggi commented Nov 14, 2024

JulienVig commented Oct 16, 2024 •

edited

Loading